Skip to content

(Phase 1)Add ModelOpt FP8 auto-detect support for diffusion checkpoints #2709#2913

Merged
lishunyang12 merged 39 commits into
vllm-project:mainfrom
baonudesifeizhai:omni2709
May 9, 2026
Merged

(Phase 1)Add ModelOpt FP8 auto-detect support for diffusion checkpoints #2709#2913
lishunyang12 merged 39 commits into
vllm-project:mainfrom
baonudesifeizhai:omni2709

Conversation

@baonudesifeizhai
Copy link
Copy Markdown
Contributor

@baonudesifeizhai baonudesifeizhai commented Apr 19, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

#2709

This PR adds Phase 1 support for ModelOpt FP8 diffusion checkpoints.

  • Auto-detects quantization_config from diffusion checkpoint configs.
  • Resolves generic fp8 stage configs to checkpoint-specific ModelOpt FP8 when serialized ModelOpt metadata is present.
  • Adds a ModelOpt FP8 checkpoint adapter for diffusers-style weight loading.
  • Extends HunyuanImage-3 ModelOpt FP8 loading for attention and MoE scalar scales.
  • Adds FP8 stage configs for supported image backbones.

Validation

Validated ModelOpt FP8 image generation on:

  • Flux
  • Flux2-Klein
  • Qwen-Image
  • HunyuanImage-3

Benchmark Setup

All results below use the following settings unless otherwise noted:

  • num-prompts=100
  • request-rate=inf
  • warmup-requests=0
  • width=1024
  • height=1024
  • num-inference-steps=20
  • seed=42

For online serving benchmarks, we use:

  • max-concurrency=32

Note:

  • Offline results are sequential offline benchmarks.
  • Online results are serving benchmarks under concurrency 32.
  • ModelOpt FP8 refers to pre-quantized offline checkpoints loaded through the ModelOpt checkpoint path.
  • For HunyuanImage3, the offline results were run on 4 GPUs while the online results were run on 2 GPUs, so only BF16 vs ModelOpt FP8 within the same mode should be compared directly.

BF16 vs ModelOpt FP8

Model Mode BF16 Throughput (req/s) ModelOpt FP8 Throughput (req/s) BF16 Mean Latency (s) ModelOpt FP8 Mean Latency (s) BF16 Peak Mem (MB) ModelOpt FP8 Peak Mem (MB) ModelOpt FP8 vs BF16
HunyuanImage3 Offline 0.2362 0.2606 4.1992 3.8015 N/A N/A +10.3% throughput, -9.5% latency
HunyuanImage3 Online 0.19 0.23 142.5196 119.0200 66526 65898 +21.1% throughput, -16.5% latency, -0.9% peak memory
Qwen-Image-2512 Offline 0.2853 0.3139 3.4734 3.1535 N/A N/A +10.0% throughput, -9.2% latency
Qwen-Image-2512 Online 0.29 0.31 93.7505 86.3423 59404 52764 +6.9% throughput, -7.9% latency, -11.2% peak memory
Z-Image Offline 0.3717 0.3879 2.6621 2.5444 N/A N/A +4.4% throughput, -4.4% latency
Z-Image Online 0.38 0.39 72.0314 68.8272 23852 22052 +2.6% throughput, -4.4% latency, -7.5% peak memory
FLUX.2-dev Offline 0.1040 0.0913 9.5824 10.9228 N/A N/A -12.2% throughput, +14.0% latency
FLUX.2-dev Online 0.21 0.18 131.8690 150.1522 87280 72304 -14.3% throughput, +13.9% latency, -17.2% peak memory
FLUX.2-klein-4B Offline 1.5477 1.5823 0.6122 0.5942 N/A N/A +2.2% throughput, -2.9% latency
FLUX.2-klein-4B Online 1.63 1.69 16.5748 16.0315 21418 19670 +3.7% throughput, -3.3% latency, -8.2% peak memory

Offline vs Online

Model Precision Offline Throughput (req/s) Online Throughput (req/s) Offline Mean Latency (s) Online Mean Latency (s) Online Peak Mem (MB)
HunyuanImage3 BF16 0.2362 0.19 4.1992 142.5196 66526
HunyuanImage3 ModelOpt FP8 0.2606 0.23 3.8015 119.0200 65898
Qwen-Image-2512 BF16 0.2853 0.29 3.4734 93.7505 59404
Qwen-Image-2512 ModelOpt FP8 0.3139 0.31 3.1535 86.3423 52764
Z-Image BF16 0.3717 0.38 2.6621 72.0314 23852
Z-Image ModelOpt FP8 0.3879 0.39 2.5444 68.8272 22052
FLUX.2-dev BF16 0.1040 0.21 9.5824 131.8690 87280
FLUX.2-dev ModelOpt FP8 0.0913 0.18 10.9228 150.1522 72304
FLUX.2-klein-4B BF16 1.5477 1.63 0.6122 16.5748 21418
FLUX.2-klein-4B ModelOpt FP8 1.5823 1.69 0.5942 16.0315 19670

Observations

  • HunyuanImage3, Qwen-Image-2512, Z-Image, and FLUX.2-klein-4B all show consistent gains from ModelOpt FP8 in both offline and online settings.
  • FLUX.2-dev is the main exception in this set: ModelOpt FP8 reduces peak memory, but both offline and online throughput regress relative to BF16.
  • The largest online improvement in this batch is HunyuanImage3, with roughly 21% throughput gain and 16% mean latency reduction.

TODO

  • FLUX.2-dev ‘qwen image ’ lantency
  • backend currently forced to cutlass --For ModelOpt FP8 diffusion, each layer still follows:
    BF16 activation → FP8 activation quantization → FP8 GEMM → BF16 output
    --

Test Plan

CUDA_VISIBLE_DEVICES=0,1 \
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH \
/root/zdj/vllm/.venv/bin/python \
  examples/offline_inference/text_to_image/text_to_image.py \
  --model /root/zdj/models/flux1-dev-modelopt-fp8 \
  --stage-configs-path vllm_omni/model_executor/stage_configs/flux_dit_2gpu_fp8.yaml \
  --prompt "a small red ceramic teapot on a wooden table, soft window light" \
  --height 512 \
  --width 512 \
  --num-inference-steps 2 \
  --seed 42 \
  --output outputs/flux_modelopt_fp8.png \
  --enforce-eager \
  --stage-init-timeout 900 \
  --init-timeout 900 \
  2>&1 | tee outputs/flux_modelopt_fp8.log
image

 CUDA_VISIBLE_DEVICES=0,1 \
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH \
/root/zdj/vllm/.venv/bin/python \
  examples/offline_inference/text_to_image/text_to_image.py \
  --model /root/zdj/models/flux2-klein-4b-modelopt-fp8 \
  --stage-configs-path vllm_omni/model_executor/stage_configs/flux2_klein_dit_2gpu_fp8.yaml \
  --prompt "a cozy Tokyo cafe corner at night, warm tungsten lighting, rain on the window, ceramic coffee cup, highly detailed, cinematic photograph" \
  --negative-prompt "blurry, low quality, distorted, deformed, oversaturated" \
  --cfg-scale 4.0 \
  --height 512 \
  --width 512 \
  --num-inference-steps 20 \
  --seed 42 \
  --output outputs/flux2_klein_modelopt_fp8.png \
  --stage-init-timeout 900 \
  --init-timeout 900 \
  2>&1 | tee outputs/flux2_klein_modelopt_fp8.log

image

modeloptfp8 for qwen-image:
https://paste.ubuntu.com/p/gby859n2Qt/

 CUDA_VISIBLE_DEVICES=0 \
/root/zdj/vllm/.venv/bin/python \
  /tmp/quantize_qwen_image_modelopt_fp8.py \
  --model /root/zdj/models/qwen-image \
  --output /root/zdj/models/qwen-image-modelopt-fp8 \
  --calib-size 8 \
  --calib-steps 8 \
  --height 512 \
  --width 512 \
  --overwrite \
  2>&1 | tee outputs/qwen_image_modelopt_fp8_export.log
CUDA_VISIBLE_DEVICES=0,1 \
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH \
/root/zdj/vllm/.venv/bin/python \
  examples/offline_inference/text_to_image/text_to_image.py \
  --model /root/zdj/models/qwen-image-modelopt-fp8 \
  --stage-configs-path vllm_omni/model_executor/stage_configs/qwen_image_dit_2gpu_fp8.yaml \
  --prompt "a clean product photo of a blue enamel mug on a white desk, realistic lighting" \
  --negative-prompt "blurry, low quality, distorted, deformed, oversaturated" \
  --cfg-scale 4.0 \
  --height 512 \
  --width 512 \
  --num-inference-steps 20 \
  --seed 42 \
  --output outputs/qwen_image_modelopt_fp8.png \
  --enforce-eager \
  --stage-init-timeout 900 \
  --init-timeout 900 \
  2>&1 | tee outputs/qwen_image_modelopt_fp8.log
image image

hunyuan modeoptfp8 : https://paste.ubuntu.com/p/dTgpmNzw3K/

CUDA_VISIBLE_DEVICES=0,1
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH
/root/zdj/vllm/.venv/bin/python
examples/offline_inference/text_to_image/text_to_image.py
--model /root/zdj/models/hunyuan-image3-modelopt-fp8
--stage-configs-path vllm_omni/model_executor/stage_configs/hunyuan_image3_moe_dit_2gpu_fp8.yaml
--prompt "a cinematic photo of a red fox standing in a snowy pine forest, soft morning light, highly detailed"
--guidance-scale 4.0
--height 512
--width 512
--num-inference-steps 20
--seed 42
--use-system-prompt en_vanilla
--output outputs/hunyuan_image3_modelopt_fp8_steps20.png
--stage-init-timeout 900
--init-timeout 900
2>&1 | tee outputs/hunyuan_image3_modelopt_fp8_steps20.log
image

CUDA_VISIBLE_DEVICES=0,1 \
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH \
/root/zdj/vllm/.venv/bin/python \
  examples/offline_inference/text_to_image/text_to_image.py \
  --model /root/zdj/models/hunyuan-image3-modelopt-fp8 \
  --stage-configs-path vllm_omni/model_executor/stage_configs/hunyuan_image3_moe_dit_2gpu_fp8.yaml \
  --prompt "a cinematic close-up photo of a glass greenhouse in a snowy mountain village at sunrise, warm golden light glowing through the windows, frost on the glass, pine trees, soft mist, ultra detailed, realistic photography" \
  --guidance-scale 4.0 \
  --height 512 \
  --width 512 \
  --num-inference-steps 20 \
  --seed 123 \
  --use-system-prompt en_vanilla \
  --output outputs/hunyuan_image3_modelopt_fp8_greenhouse_steps20.png \
  --stage-init-timeout 900 \
  --init-timeout 900 \
  2>&1 | tee outputs/hunyuan_image3_modelopt_fp8_greenhouse_steps20.log
image
Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@baonudesifeizhai
Copy link
Copy Markdown
Contributor Author

baonudesifeizhai commented Apr 19, 2026

flux2dev modelopt fp8 script:
https://paste.ubuntu.com/p/Pkw5Wsjv4q/

 CUDA_VISIBLE_DEVICES=0,1 \
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH \
/root/zdj/vllm/.venv/bin/python \
  examples/offline_inference/text_to_image/text_to_image.py \
  --model /root/zdj/models/flux2-dev-modelopt-fp8 \
  --stage-configs-path /tmp/flux2_dev_dit_2gpu_fp8.yaml \
  --prompt "a luxury art deco train dining car at golden hour, emerald velvet seats, brass lamps, rain streaks on the windows, cinematic wide angle photograph, highly detailed" \
  --guidance-scale 2.5 \
  --height 512 \
  --width 512 \
  --num-inference-steps 20 \
  --seed 123 \
  --output outputs/flux2_dev_modelopt_fp8_artdeco_train_steps20.png \
  --stage-init-timeout 900 \
  --init-timeout 900 \
  2>&1 | tee outputs/flux2_dev_modelopt_fp8_artdeco_train_steps20.log
image

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

BLOCKING:

  • Test Coverage — No e2e online serving test. Please add a test that:
    1. Starts vllm serve <model> --omni
    2. Sends a generation request via the API
    3. Asserts the response contains a valid image

ModelOpt FP8 checkpoints should work in both Omni (offline) and vllm serve / AsyncOmni (online) modes before merging.

lishunyang12 added a commit to lishunyang12/vllm-omni that referenced this pull request Apr 19, 2026
…vllm-project#2920)

Threads quant_config / prefix through HunyuanVideo15Attention,
HunyuanVideo15TransformerBlock, and HunyuanVideo15Transformer3DModel so
the modelopt FP8 adapter from vllm-project#2913 has somewhere to bind per-layer scales.
Modulation, embeddings, proj_out stay raw nn.Linear (full precision).

Signed-off-by: lishunyang <lishunyang12@163.com>
lishunyang12 added a commit to lishunyang12/vllm-omni that referenced this pull request Apr 19, 2026
…eo-1.5

examples/quantization/quantize_hunyuanvideo_15_modelopt_fp8.py:
  Offline calibration helper that produces a ModelOpt FP8 diffusers checkpoint
  for HunyuanVideo-1.5. Calibrates with 8 video prompts x 10 denoising steps,
  skips precision-sensitive layers (modulation, embeddings, output proj,
  token refiner) matching the vllm-project#2728 / vllm-project#2795 pattern, disables MHA quantizers
  by default (HV-1.5 self-attention degrades visibly under FP8 - see vllm-project#2920).

vllm_omni/model_executor/stage_configs/hunyuan_video_15_dit_fp8.yaml:
  Stage config for serving the calibrated checkpoint via vllm-omni. Auto-detects
  ModelOpt metadata from the checkpoint (uses vllm-project#2913's adapter).

Signed-off-by: lishunyang <lishunyang12@163.com>
@baonudesifeizhai
Copy link
Copy Markdown
Contributor Author

 CUDA_VISIBLE_DEVICES=0,1 \
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH \
/root/zdj/vllm/.venv/bin/python -m vllm_omni.entrypoints.cli.main \
  serve /root/zdj/models/flux2-dev-modelopt-fp8 \
  --omni \
  --host 127.0.0.1 \
  --port 8000 \
  --stage-configs-path /tmp/flux2_dev_dit_2gpu_fp8.yaml \
  --stage-init-timeout 900 \
  --init-timeout 900 \
  2>&1 | tee outputs/flux2_dev_modelopt_fp8_online_server.log

prompt:https://paste.ubuntu.com/p/ypkqDtNxQN/

5f5baec674d4556e412ec6b2c49fb2f7

lishunyang12 added a commit to lishunyang12/vllm-omni that referenced this pull request Apr 19, 2026
The default export_hf_checkpoint() doesn't actually serialize weights as FP8
for unknown model types like HunyuanVideo15Transformer3DModel — it saves
BF16 placeholders. The HunyuanImage-3 calibration helper hit the same bug.

Three changes:
- Manually call modelopt.torch.export.unified_export_hf._export_quantized_weight
  per-module to convert in-memory tensors to actual FP8.
- Save the pipeline by hand (copy source minus transformer/, then save the
  quantized transformer with hide_quantizers_from_state_dict).
- Patch transformer/config.json to inject quant_algo: FP8 + config_groups so
  vllm-omni's adapter (vllm-project#2913) auto-detects it.

Signed-off-by: lishunyang <lishunyang12@163.com>
lishunyang12 added a commit to lishunyang12/vllm-omni that referenced this pull request Apr 19, 2026
…block

When --weight-block-size 'M,N' is given, override the weight quantizer with
block_sizes={-1: N, -2: M} so each linear gets a (out//M, in//N) scale tensor
instead of a scalar. Patched config_groups advertises strategy='block' +
block_structure='MxN' so consumers know what to expect.

Static FP8 is exempt from upstream vLLM's online block-wise gate, so this
just works at serving time via vllm-project#2913's adapter.

Default behavior unchanged (per-tensor) — pass --weight-block-size 128,128
to opt in.

Signed-off-by: lishunyang <lishunyang12@163.com>
lishunyang12 added a commit to lishunyang12/vllm-omni that referenced this pull request Apr 19, 2026
…ject#2920)

Threads quant_config / prefix through WanSelfAttention, WanCrossAttention,
WanFeedForward (+ ColumnParallelGELU), WanTransformerBlock, and
WanTransformer3DModel / WanVACETransformer3DModel, plus the four pipelines
(T2V / I2V / TI2V / VACE). Modulation (scale_shift_table), patch_embedding
(Conv3d), time/text/image embedders, and proj_out stay full precision.

All attention + FFN linears receive quant_config so the ModelOpt FP8 adapter
from vllm-project#2913 can bind per-layer scales at load time. The aggressive skip
patterns from vllm-project#2920 (attn1/attn2 quant_config=None) are NOT applied here —
that was an online-FP8 quality workaround; static calibration handles it.

Signed-off-by: lishunyang <lishunyang12@163.com>
lishunyang12 added a commit to lishunyang12/vllm-omni that referenced this pull request Apr 19, 2026
…ject#2920)

Threads quant_config / prefix through WanSelfAttention, WanCrossAttention,
WanFeedForward (+ ColumnParallelGELU), WanTransformerBlock, and
WanTransformer3DModel / WanVACETransformer3DModel, plus the four pipelines
(T2V / I2V / TI2V / VACE). Modulation (scale_shift_table), patch_embedding
(Conv3d), time/text/image embedders, and proj_out stay full precision.

All attention + FFN linears receive quant_config so the ModelOpt FP8 adapter
from vllm-project#2913 can bind per-layer scales at load time. The aggressive skip
patterns from vllm-project#2920 (attn1/attn2 quant_config=None) are NOT applied here —
that was an online-FP8 quality workaround; static calibration handles it.

Signed-off-by: lishunyang <lishunyang12@163.com>
@baonudesifeizhai
Copy link
Copy Markdown
Contributor Author

baonudesifeizhai commented Apr 20, 2026

z image :
For Z-Image ModelOpt FP8, the main caution is that not all linear layers are equally stable under FP8.
Use a conservative quantization profile:
Also preserve full transformer submodule prefixes during loading. Z-Image ignore-list matching depends on names like layers..attention.to_out, layers..feed_forward.w2, noise_refiner., and context_refiner.; wrong prefixes can silently produce corrupted images.
https://paste.ubuntu.com/p/F8hdb5SMnY/

CUDA_VISIBLE_DEVICES=0 \
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH \
/root/zdj/vllm/.venv/bin/python \
  /tmp/quantize_z_image_base_modelopt_fp8.py \
  --model /root/zdj/models/z-image \
  --output /root/zdj/models/z-image-modelopt-fp8-conservative \
  --profile conservative \
  --calib-size 8 \
  --calib-steps 28 \
  --height 512 \
  --width 512 \
  --guidance-scale 4.0 \
  --overwrite \
  2>&1 | tee outputs/z_image_modelopt_fp8_conservative_export.log

offline

 CUDA_VISIBLE_DEVICES=0,1 \
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH \
/root/zdj/vllm/.venv/bin/python \
  examples/offline_inference/text_to_image/text_to_image.py \
  --model /root/zdj/models/z-image-modelopt-fp8-conservative \
  --stage-configs-path vllm_omni/model_executor/stage_configs/z_image_dit_2gpu_fp8.yaml \
  --prompt "an Elden Ring style lone tarnished knight standing before a shattered cathedral under a dying golden tree, ruined stone arches, drifting ash, dramatic god rays, dark fantasy, cinematic, ultra detailed" \
  --negative-prompt "blurry, low quality, distorted, deformed, watermark" \
  --guidance-scale 4.0 \
  --height 512 \
  --width 512 \
  --num-inference-steps 28 \
  --seed 42 \
  --output outputs/z_image_modelopt_fp8_conservative_steps28.png \
  --stage-init-timeout 900 \
  --init-timeout 900 \
  2>&1 | tee outputs/z_image_modelopt_fp8_conservative_steps28.log
image
 CUDA_VISIBLE_DEVICES=0,1 \
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH \
/root/zdj/vllm/.venv/bin/python -m vllm_omni.entrypoints.cli.main \
  serve /root/zdj/models/z-image-modelopt-fp8-conservative \
  --omni \
  --host 127.0.0.1 \
  --port 8000 \
  --stage-configs-path vllm_omni/model_executor/stage_configs/z_image_dit_2gpu_fp8.yaml \
  --stage-init-timeout 900 \
  --init-timeout 900 \
  2>&1 | tee outputs/z_image_modelopt_fp8_conservative_online_server.log

 curl -s http://127.0.0.1:8000/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "a Horus Heresy scene, a towering Space Marine in battered crusade-era power armor standing inside a ruined imperial cathedral during the age of civil war, shattered aquila banners, burning censers, broken stained glass, ash and embers drifting through the air, tragic gothic atmosphere, dramatic god rays, cinematic, ultra detailed",
    "negative_prompt": "blurry, low quality, distorted, deformed, watermark, extra limbs, bad anatomy",
    "size": "512x512",
    "num_inference_steps": 28,
    "guidance_scale": 4.0,
    "seed": 42,
    "response_format": "b64_json"
  }' \
  | jq -r '.data[0].b64_json' \
  | base64 -d \
  > outputs/z_image_modelopt_fp8_conservative_online_horus_heresy_steps28.png
image

@baonudesifeizhai
Copy link
Copy Markdown
Contributor Author

baonudesifeizhai commented Apr 20, 2026

for online qwen-image:

CUDA_VISIBLE_DEVICES=0,1 \
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH \
/root/zdj/vllm/.venv/bin/python -m vllm_omni.entrypoints.cli.main \
  serve /root/zdj/models/qwen-image-modelopt-fp8 \
  --omni \
  --host 127.0.0.1 \
  --port 8000 \
  --stage-configs-path vllm_omni/model_executor/stage_configs/qwen_image_dit_2gpu_fp8.yaml \
  --stage-init-timeout 900 \
  --init-timeout 900 \
  2>&1 | tee outputs/qwen_image_modelopt_fp8_online_2gpu_server.log
curl -sS http://127.0.0.1:8000/v1/images/generations \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer EMPTY" \
  -d '{
    "model": "/root/zdj/models/qwen-image-modelopt-fp8",
    "prompt": "a cinematic photo of a red fox standing in a snowy pine forest, soft morning light, highly detailed, realistic fur texture",
    "negative_prompt": "blurry, low quality, distorted, deformed, oversaturated",
    "size": "512x512",
    "response_format": "b64_json",
    "n": 1,
    "num_inference_steps": 20,
    "true_cfg_scale": 4.0,
    "seed": 42
  }' \
  | tee outputs/qwen_image_modelopt_fp8_online_2gpu_response.json
python - <<'PY'
import base64, json
from pathlib import Path

payload = json.loads(Path("outputs/qwen_image_modelopt_fp8_online_2gpu_response.json").read_text())
Path("outputs/qwen_image_modelopt_fp8_online_2gpu_steps20.png").write_bytes(
    base64.b64decode(payload["data"][0]["b64_json"])
)
print("saved outputs/qwen_image_modelopt_fp8_online_2gpu_steps20.png")
PY

image

offline:

CUDA_VISIBLE_DEVICES=0,1 \
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH \
/root/zdj/vllm/.venv/bin/python \
  examples/offline_inference/text_to_image/text_to_image.py \
  --model /root/zdj/models/qwen-image-modelopt-fp8 \
  --stage-configs-path vllm_omni/model_executor/stage_configs/qwen_image_dit_2gpu_fp8.yaml \
  --prompt "a grimdark Warhammer 40,000 style hive city stretching into a poisoned orange sky, endless gothic megastructures, towering manufactorum spires, cathedral-like hab blocks, polluted atmosphere, flying gunships, crowds of tiny pilgrims and workers below, dramatic volumetric light, ash and smoke, cinematic, ultra detailed" \
  --negative-prompt "blurry, low quality, distorted, deformed, oversaturated, watermark, text" \
  --cfg-scale 4.0 \
  --height 512 \
  --width 512 \
  --num-inference-steps 20 \
  --seed 42 \
  --output outputs/qwen_image_modelopt_fp8_warhammer40k_hive_city_steps20.png \
  --stage-init-timeout 900 \
  --init-timeout 900 \
  2>&1 | tee outputs/qwen_image_modelopt_fp8_warhammer40k_hive_city_steps20.log
image

@baonudesifeizhai
Copy link
Copy Markdown
Contributor Author

for e2e test:
CUDA_VISIBLE_DEVICES=0,1
VLLM_TARGET_DEVICE=cuda
VLLM_WORKER_MULTIPROC_METHOD=spawn
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH
/root/zdj/vllm/.venv/bin/python -m pytest
tests/e2e/online_serving/test_modelopt_fp8_image_serving.py::test_modelopt_fp8_images_api_returns_valid_image
-s
--run-level advanced_model
--tb=short
passed

@david6666666
Copy link
Copy Markdown
Collaborator

We should have a unified model weight conversion script, such as those in vllm-omni/vllm_omni/quantization/tools, and compare_diffusion_trajectory_similarity scripts. WDYT @baonudesifeizhai @lishunyang12

@lishunyang12
Copy link
Copy Markdown
Collaborator

Quality outputs look good but we have no perf numbers for any of the 5 models. Can you share:

  • Latency + peak memory table for bf16 vs modelopt-fp8 (at least Flux + HunyuanImage-3)
  • Profiler trace per the profiling guide for one model — top-N kernels to confirm fp8 GEMM path is actually active and not silently falling back to bf16
  • List of layers that fell back to bf16 (skipped/unsupported) and why

Want to validate the perf story before merging.

@baonudesifeizhai
Copy link
Copy Markdown
Contributor Author

baonudesifeizhai commented Apr 24, 2026

after force_kernel=PerTensorTorchFP8ScaledMMLinearKernel on vllm side ...

 


================= Serving Benchmark Result =================
Backend:                                 vllm-omni
Model:                                   /root/zdj/models/flux2-dev-modelopt-fp8
Dataset:                                 random
Task:                                    t2i
--------------------------------------------------
Benchmark duration (s):                  84.41
Request rate:                            inf
Max request concurrency:                 16
Successful requests:                     50/50
--------------------------------------------------
Request throughput (req/s):              0.59
Latency Mean (s):                        22.9657
Latency Median (s):                      26.9953
Latency P99 (s):                         27.0379
Latency P95 (s):                         27.0342
--------------------------------------------------
Peak Memory Max (MB):                    65390.00
Peak Memory Mean (MB):                   65390.00
Peak Memory Median (MB):                 65390.00

vs 


================= Serving Benchmark Result =================
Backend:                                 vllm-omni
Model:                                   /root/zdj/models/flux2-dev
Dataset:                                 random
Task:                                    t2i
--------------------------------------------------
Benchmark duration (s):                  92.37
Request rate:                            inf
Max request concurrency:                 16
Successful requests:                     50/50
--------------------------------------------------
Request throughput (req/s):              0.54
Latency Mean (s):                        25.1176
Latency Median (s):                      29.5167
Latency P99 (s):                         29.5819
Latency P95 (s):                         29.5704
--------------------------------------------------
Peak Memory Max (MB):                    80366.00
Peak Memory Mean (MB):                   80366.00
Peak Memory Median (MB):                 80366.00

============================================================
Metrics saved to outputs/perf/flux2_dev_bf16_2gpu_c16_n50.json
================= Serving Benchmark Result =================
Backend:                                 vllm-omni
Model:                                   /root/zdj/models/flux2-klein-4b-modelopt-fp8
Dataset:                                 random
Task:                                    t2i
--------------------------------------------------
Benchmark duration (s):                  18.59
Request rate:                            inf
Max request concurrency:                 16
Successful requests:                     50/50
--------------------------------------------------
Request throughput (req/s):              2.69
Latency Mean (s):                        5.0586
Latency Median (s):                      5.9344
Latency P99 (s):                         5.9611
Latency P95 (s):                         5.9536
--------------------------------------------------
Peak Memory Max (MB):                    12758.00
Peak Memory Mean (MB):                   12758.00
Peak Memory Median (MB):                 12758.00
vs 

================= Serving Benchmark Result =================
Backend:                                 vllm-omni
Model:                                   /root/zdj/models/flux2-klein-4b
Dataset:                                 random
Task:                                    t2i
--------------------------------------------------
Benchmark duration (s):                  19.69
Request rate:                            inf
Max request concurrency:                 16
Successful requests:                     50/50
--------------------------------------------------
Request throughput (req/s):              2.54
Latency Mean (s):                        5.3560
Latency Median (s):                      6.2916
Latency P99 (s):                         6.3101
Latency P95 (s):                         6.3021
--------------------------------------------------
Peak Memory Max (MB):                    14506.00
Peak Memory Mean (MB):                   14506.00
Peak Memory Median (MB):                 14506.00

================= Serving Benchmark Result =================
Backend:                                 vllm-omni
Model:                                   /root/zdj/models/hunyuan-image3-modelopt-fp8
Dataset:                                 random
Task:                                    t2i
--------------------------------------------------
Benchmark duration (s):                  241.25
Request rate:                            inf
Max request concurrency:                 16
Successful requests:                     50/50
--------------------------------------------------
Request throughput (req/s):              0.21
Latency Mean (s):                        65.6834
Latency Median (s):                      76.9524
Latency P99 (s):                         77.5409
Latency P95 (s):                         77.2703
--------------------------------------------------
Peak Memory Max (MB):                    96940.00
Peak Memory Mean (MB):                   96940.00
Peak Memory Median (MB):                 96940.00

============================================================

vs 

================= Serving Benchmark Result =================
Backend:                                 vllm-omni
Model:                                   /root/zdj/models/hunyuan-image3
Dataset:                                 random
Task:                                    t2i
--------------------------------------------------
Benchmark duration (s):                  282.36
Request rate:                            inf
Max request concurrency:                 16
Successful requests:                     50/50
--------------------------------------------------
Request throughput (req/s):              0.18
Latency Mean (s):                        76.8621
Latency Median (s):                      90.0141
Latency P99 (s):                         90.8956
Latency P95 (s):                         90.5725
--------------------------------------------------
Peak Memory Max (MB):                    135402.00
Peak Memory Mean (MB):                   135402.00
Peak Memory Median (MB):                 135402.00

============================================================
================= Serving Benchmark Result =================
Backend:                                 vllm-omni
Model:                                   /root/zdj/models/qwen-image-modelopt-fp8
Dataset:                                 random
Task:                                    t2i
--------------------------------------------------
Benchmark duration (s):                  110.26
Request rate:                            inf
Max request concurrency:                 16
Successful requests:                     50/50
--------------------------------------------------
Request throughput (req/s):              0.45
Latency Mean (s):                        29.9892
Latency Median (s):                      35.1732
Latency P99 (s):                         35.3604
Latency P95 (s):                         35.3243
--------------------------------------------------
Peak Memory Max (MB):                    39730.00
Peak Memory Mean (MB):                   39729.60
Peak Memory Median (MB):                 39730.00

============================================================
vs

================= Serving Benchmark Result =================
Backend:                                 vllm-omni
Model:                                   /root/zdj/models/qwen-image
Dataset:                                 random
Task:                                    t2i
--------------------------------------------------
Benchmark duration (s):                  102.26
Request rate:                            inf
Max request concurrency:                 16
Successful requests:                     50/50
--------------------------------------------------
Request throughput (req/s):              0.49
Latency Mean (s):                        27.8345
Latency Median (s):                      32.6885
Latency P99 (s):                         32.7558
Latency P95 (s):                         32.7227
--------------------------------------------------
Peak Memory Max (MB):                    46464.00
Peak Memory Mean (MB):                   46464.00
Peak Memory Median (MB):                 46464.00

============================================================

@baonudesifeizhai
Copy link
Copy Markdown
Contributor Author

https://paste.ubuntu.com/p/92yBc9x7bB/

curl -sS http://127.0.0.1:8160/v1/chat/completions \
  -H "Content-Type: application/json" \
  --data-raw '{
    "model": "/root/zdj/models/qwen-image-2512-modelopt-fp8-dynamic-all",
    "messages": [
      {
        "role": "user",
        "content": "A beautiful cinematic photo of a small red fox sitting in a snowy forest, ultra detailed, soft natural light"
      }
    ],
    "max_tokens": 1024
  }' \
| /root/zdj/vllm/.venv/bin/python -c 'import sys,json,re,base64; r=json.load(sys.stdin); s=json.dumps(r); m=re.search(r"data:image/png;base64,([A-Za-z0-9+/=]+)", s); assert m, s[:1000]; open("output.png","wb").write(base64.b64decode(m.group(1))); print("saved output.png")'

image
 curl -sS http://127.0.0.1:8160/v1/chat/completions \
  -H "Content-Type: application/json" \
  --data-raw '{
    "model": "/root/zdj/models/qwen-image-2512-modelopt-fp8-dynamic-all",
    "messages": [
      {
        "role": "user",
        "content": "A colossal warp monster emerging from a torn reality rift inside a gothic sci-fi battlefield, grimdark far future war aesthetic, twisted horns, glowing eyes, corrupted flesh, black armor fragments, chaotic purple and red energy, cathedral ruins, smoke, fire, cinematic lighting, ultra detailed, dramatic composition"
      }
    ],
    "max_tokens": 1024
  }' \
| /root/zdj/vllm/.venv/bin/python -c 'import sys,json,re,base64; r=json.load(sys.stdin); s=json.dumps(r); m=re.search(r"data:image/png;base64,([A-Za-z0-9+/=]+)", s); assert m, s[:1000]; open("warpspawn_40k_grimdark.png","wb").write(base64.b64decode(m.group(1))); print("saved warpspawn_40k_grimdark.png")'
image
================= Serving Benchmark Result =================
Backend:                                 vllm-omni
Model:                                   /root/zdj/models/qwen-image-2512-modelopt-fp8-dynamic-all
Dataset:                                 random
Task:                                    t2i
--------------------------------------------------
Benchmark duration (s):                  254.98
Request rate:                            inf
Max request concurrency:                 32
Successful requests:                     50/50
--------------------------------------------------
Request throughput (req/s):              0.20
Latency Mean (s):                        113.6107
Latency Median (s):                      129.1408
Latency P99 (s):                         180.3258
Latency P95 (s):                         180.2265
--------------------------------------------------
Peak Memory Max (MB):                    84630.00
Peak Memory Mean (MB):                   81597.36
Peak Memory Median (MB):                 84630.00
vs 


================= Serving Benchmark Result =================
Backend:                                 vllm-omni
Model:                                   /root/zdj/models/qwen-image-2512
Dataset:                                 random
Task:                                    t2i
--------------------------------------------------
Benchmark duration (s):                  285.53
Request rate:                            inf
Max request concurrency:                 32
Successful requests:                     50/50
--------------------------------------------------
Request throughput (req/s):              0.18
Latency Mean (s):                        127.2317
Latency Median (s):                      144.4706
Latency P99 (s):                         202.5307
Latency P95 (s):                         202.4224
--------------------------------------------------
Peak Memory Max (MB):                    97340.00
Peak Memory Mean (MB):                   94306.96
Peak Memory Median (MB):                 97340.00

============================================================

@baonudesifeizhai
Copy link
Copy Markdown
Contributor Author

cat >/tmp/modelopt_quality_cases.json <<'JSON'
[
  {
    "id": "qwen_image_2512_modelopt_fp8_dynamic_all",
    "baseline_model": "/root/zdj/models/qwen-image-2512",
    "quantized_model": "/root/zdj/models/qwen-image-2512-modelopt-fp8-dynamic-all",
    "task": "t2i",
    "prompt": "a fox sitting in the snow in a forest, realistic photo",
    "max_lpips": 0.35,
    "height": 1024,
    "width": 1024,
    "num_inference_steps": 20,
    "seed": 42,
    "negative_prompt": "blurry, low quality"
  }
]
JSON
export VLLM_OMNI_QUALITY_CONFIGS=/tmp/modelopt_quality_cases.json
export VLLM_OMNI_QUALITY_OUTPUT_DIR=/tmp/modelopt_quality_outputs

PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:${PYTHONPATH:-} \
/root/zdj/vllm/.venv/bin/python -m pytest \
  tests/diffusion/quantization/test_quantization_quality.py \
  -v -m "" -k qwen_image_2512_modelopt_fp8_dynamic_all

tests/diffusion/quantization/test_quantization_quality.py::test_quantization_quality[qwen_image_2512_modelopt_fp8_dynamic_all] PASSED [100%]

Comment thread vllm_omni/model_executor/stage_configs/flux2_klein_dit_2gpu_fp8.yaml Outdated
Comment thread vllm_omni/quantization/factory.py Outdated
self.shared_experts = None

self.experts = SharedFusedMoE(
self.experts = FusedMoE(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We haven't validated this model yet, so we won't modify it for now.

from vllm.inputs import MultiModalDataDict
from vllm.logger import init_logger
from vllm.model_executor.layers.fused_moe import SharedFusedMoE
from vllm.model_executor.layers.fused_moe import FusedMoE
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should this be changed to FusedMoE

from vllm.entrypoints.pooling.embed.serving import ServingEmbedding as OpenAIServingEmbedding
from vllm.entrypoints.pooling.pooling.serving import OpenAIServingPooling
from vllm.entrypoints.pooling.score.serving import ServingScores
from vllm.entrypoints.pooling.pooling.serving import ServingPooling as OpenAIServingPooling
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should we modify it here?


import vllm.forward_context as _vllm_fc
from vllm.model_executor.layers.fused_moe import SharedFusedMoE
from vllm.model_executor.layers.fused_moe import FusedMoE
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

return rotary_position_embedding(x, cos, sin, rotated_mode="rotated_half", head_first=False, fused=True)


def _ensure_batch_dim(x: torch.Tensor) -> tuple[torch.Tensor, bool]:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should we modify it here?

quantization="fp8",
task="t2i",
prompt="a cup of coffee on a wooden table, morning light",
max_lpips=0.35,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the max_lpips threshold was set too arbitrarily before, and one metric isn't enough; we need to add metrics like psnr or mae to monitor it. I believe we should define this threshold properly first.

Comment thread tests/e2e/stage_configs/flux2_dev_dit_2gpu_fp8.yaml Outdated
@@ -0,0 +1,86 @@
# SPDX-License-Identifier: Apache-2.0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test should include accuracy-related tests; simply testing functionality is meaningless.

Comment thread vllm_omni/diffusion/models/hunyuan_image3/hunyuan_image3_transformer.py Outdated
Comment thread vllm_omni/quantization/factory.py Outdated
@baonudesifeizhai
Copy link
Copy Markdown
Contributor Author

CUDA_VISIBLE_DEVICES=2,1 \
VLLM_WORKER_MULTIPROC_METHOD=spawn \
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:${PYTHONPATH:-} \
/root/zdj/vllm/.venv/bin/python -m vllm_omni.entrypoints.cli.main serve \
  /root/zdj/models/z-image-modelopt-fp8-conservative \
  --omni \
  --host 0.0.0.0 \
  --port 8102 \
  --tensor-parallel-size 2 \
  --force-cutlass-fp8 \
  --stage-init-timeout 900 \
  --init-timeout 900 \
  2>&1 | tee outputs/z_image_modelopt_fp8_conservative_cli_cutlass_server.log

mkdir -p /root/zdj/vllm-omni/outputs

curl -sS http://127.0.0.1:8102/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/root/zdj/models/z-image-modelopt-fp8-conservative",
    "prompt": "grimdark far-future gothic sci-fi battlefield, a towering power-armored knight in black and crimson armor, cathedral ruins, burning incense, ash storm, massive gothic machinery, dramatic rim light, ultra detailed, cinematic, no text, no logo",
    "size": "1024x1024",
    "num_inference_steps": 20,
    "seed": 42,
    "negative_prompt": "blurry, low quality, distorted, deformed, oversaturated, text, logo"
  }' | /root/zdj/vllm/.venv/bin/python -c '
import sys, json, base64
r = json.load(sys.stdin)
open("/root/zdj/vllm-omni/outputs/z_image_modelopt_fp8_grimdark_40k_style.png", "wb").write(
    base64.b64decode(r["data"][0]["b64_json"])
)
'


image

CLI:

```bash
python text_to_image.py --model <your-model> --quantization fp8
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should add modelopt.md and .nav.yml such as https://docs.vllm.ai/en/latest/features/quantization/modelopt/ and follow vllm-omni quantization style.

Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
Comment thread docs/user_guide/quantization/modelopt.md Outdated
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
@david6666666 david6666666 added the ready label to trigger buildkite CI label May 9, 2026
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
@david6666666
Copy link
Copy Markdown
Collaborator

LGTM now. @lishunyang12 PTAL thx

@lishunyang12 lishunyang12 merged commit c4a0990 into vllm-project:main May 9, 2026
8 checks passed
@david6666666
Copy link
Copy Markdown
Collaborator

We may need to add a modelopt quantization script tool later. Thank you for your contribution.

clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
vllm-project#2709 (vllm-project#2913)

Signed-off-by: roG0d <rodgarcas98@gmail.com>
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
Signed-off-by: baonudesifeizhai <85092850+baonudesifeizhai@users.noreply.github.com>
Co-authored-by: roG0d <rodgarcas98@gmail.com>
fhfuih added a commit to fhfuih/vllm-omni that referenced this pull request May 15, 2026
Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
Gaohan123 pushed a commit that referenced this pull request May 15, 2026
Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
Signed-off-by: Zeyu Huang | 黃澤宇 <11222265+fhfuih@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
tzhouam pushed a commit that referenced this pull request May 15, 2026
Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
Signed-off-by: Zeyu Huang | 黃澤宇 <11222265+fhfuih@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants